Sentiment Analysis with Python

What is Sentiment Analysis?

Sentiment analysis is the automated interpretation and classification of emotions (usually positive, negative, or neutral) from textual data such as written reviews and social media posts.

In this notebook, we will be looking at a Kaggle dataset "Amazon Fine Food Reviews" to perform analysis and determine if a review is positive or negative.

The data that we will be using most for this analysis is “Summary”, “Text”, and “Score.”

"Text" — This variable contains the complete product review information.

"Summary" — This is a summary of the entire review.

"Score" — The product rating provided by the customer.

"Score"

the rating provided by the customer on a scale of 1-5, 5 being the most positive, 1 being the most negative

We can see that most of the reviews are positive based on the large number of reviews with a score of 4 or more.

Classifying reviews

Classify reviews into “positive” and “negative,” so we can use this as training data for our sentiment classification model.

Positive reviews will be classified as +1, and negative reviews will be classified as -1.

We will classify all reviews with ‘Score’ > 3 as +1, indicating that they are positive.

All reviews with ‘Score’ < 3 will be classified as -1. Reviews with ‘Score’ = 3 will be dropped, because they are neutral. This model will only classify positive and negative reviews.

Building the sentiment analysis model

This model will take reviews in as input. It will then come up with a prediction on whether the review is positive or negative.

This is a classification task, so we will train a simple logistic regression model to do it.

step 1: Data Cleaning

now, the data frame will be into train and test sets.

80% of the data will be used for training, and 20% will be used for testing.

Next, we will use a count vectorizer from the Scikit-learn library to transform the text in our df into a bag of words model.

A Bag of Words model is used to preprocess the text by converting it into a 'bag of words', which keeps a count of the total occurrences of most frequently used words.

We need to convert the text into a bag-of-words model since the logistic regression algorithm cannot understand text.

Testing the model

The overall accuracy of the model on the test data is around 93%, which is pretty good considering we didn’t do any feature extraction or much preprocessing.